pricingengine.estimation package

Submodules

pricingengine.estimation.double_ml module

class pricingengine.estimation.double_ml.DoubleML(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), treatment_builders=None, feature_builders=None, sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True)

Bases: pricingengine.estimation.double_ml.DoubleMLLikeModel

Generic Double ML Model. Estimates the coefficient \(\beta\) from the following partially linear model

\(Y = f(X) + \beta \cdot D + \epsilon\)

\(D = g(X) + \mu\)

Note that the base models are cross-fit across folds (so a model’s predictions for its training data are not used).

__init__(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), treatment_builders=None, feature_builders=None, sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True)

Initialize a new DoubleML instance.

Parameters:
  • schema – The expected schema of datasets that will be fit
  • baseline_model – Instance with subclass Model to be used for computing baseline treatment and outcome prediction models in first stage regressions. This object may also be a dict which points from each column name (all treatment and outcome variables) to a corresponding Model.
  • causal_model (CausalModel) – Model to be used for computing treatment effects in second stage regression
  • error_model (Model) – Model to be used for estimating average (absolute) error size as a function of features (i.e. heteroskedasticity function)
  • feature_builders – List of VarBuilder objects used to create features for first stage regressions
  • treatment_builders – List of VarBuilder objects used to create treatments for second stage regressions
  • sample_splitter – member of sklearn.model_selection used for sample splitting. Default is KFold.
  • cluster_date – Bool (default True) input for whether or not to cluster standard erros at the level of the date column
baseline_outcome_coefficients()

Return coefficients (averaged over splits) from first stage outcome regression Will account for baseline feature scaling

baseline_treatment_coefficients(treatment_name)

Get first stage coefficients (averaged over splits) from treatment regression corresponding to the given treatment_name. Will account for baseline feature scaling

error_model

Return the model used to comptue predicted (absolute) error size

fit_baseline_models_featurized(features, outcome, treatments, folds)

Fit first-stage baseline models (but not causal model) for predicting treatment and outcome. Sub-utility used by fit_baseline_models().

Parameters:
  • features – dictionary of features used for prediction (expects one for error too)
  • outcome – dictionary of leads mapping to series of outcome leads
  • treatments – double dictionary mapping from lead and treatment_name to series of treatment leads
  • folds – list of train test splits used for cross validation
static gen_prepredicted(df)

Converts a DataFrame of recorded predictions in dictionary of PrePredicted models

static get_rec_df_from_csv(fname, schema)

Reads a csv file with recordings from a DoubleML prediction :param fname: filename of csv of recorded model predictions :param schema: Schema object :returns: DataFrame of prediction recordings

outcome_baseline_models

Return the outcome baseline models

predict_baseline(features, folds=None)
Parameters:
  • features – Either a single feature matrix or a dictionary:varname->feature matrix
  • folds
treatment_baseline_models

Return the treatment baseline models

class pricingengine.estimation.double_ml.DoubleMLLikeModel(schema, causal_model, treatment_builders, feature_builders, sample_splitter, cluster_date=True, no_constant=False)

Bases: pricingengine.estimation.regression.Estimation

An abstract baseclass for DoubleML-like models (DoubleML/DynamicDML, etc.)

NO_SPLIT = 'no split'
TYPE_COL_NAME = 'type'
__init__(schema, causal_model, treatment_builders, feature_builders, sample_splitter, cluster_date=True, no_constant=False)
Parameters:
  • schema – The expected schema of datasets that will be fit
  • causal_model (CausalModel) – Model to be used for computing treatment effects in second stage regression
  • treatment_builders – List of VarBuilder objects used to create treatments for second stage regressions
  • feature_builders – List of VarBuilder objects used to create features for first stage regressions
  • sample_splitter – member of sklearn.model_selection used for sample splitting. Default is KFold.
  • cluster_date – Bool (default True) input for whether or not to cluster standard erros at the level of the date column
  • no_constant – Bool (default False) to force the construction of ConstVar treatments with all available interactions. If True, these constants are omitted.
baseline_fit_diagnostics()

Get various prediction diagnostics for all baseline (first stage) regressions

baseline_models_feat_info(avg_splits=False, combine_vars=False)

Return baseline model coefficients for all first stage models for the given lead

Parameters:
  • avg_splits (bool) – If true avgs diagnostics acrss model splits (otherwise returns separately).
  • combine_vars – Try to combine the different variable vectors into a df (works if same feature vector) If True, will return a single DF. If false, will return a dict:varname->DF (aggregated across leads)
causal_model

Return the causal model

fit_baseline_models(estimation_dataset)

Fit baseline (but not causal models) on DDML object

Parameters:estimation_dataset – EstimationDatset object on which baseline models are fit
fit_causal_model(estimation_dataset, rm_baseline_interm_info=False, subst_treatment_builders=None)

Fit only the causal model of DDML. Requires that you have already fit baseline models.

Parameters:
  • estimation_dataset
  • rm_baseline_interm_info – If you want to fit several different causal models, pass in rm_baseline_interm_info=False
  • subst_treatment_builders – overwrites existing treatment_builders in case you want to try a different model
num_splits

Number of splits for cross-fitting

pricingengine.estimation.dynamic_dml module

class pricingengine.estimation.dynamic_dml.BaseAndError(leads)

Bases: object

Model that will fit and predict baseline models and error

__init__(leads)
baseline_fit_diagnostics()
baseline_models_feat_info(avg_splits=False, combine_vars=False)
error_model_predict(features_fit)
fit_baseline_models_featurized(common_features, lead_features, outcome_lead, treatments_lead, folds)
fit_error_model(features_fit, err)
static gen_prepredicted(df)
predict_baseline(common_features, lead_features, fold_fit_info)
class pricingengine.estimation.dynamic_dml.DDMLOptions(min_lead=1, max_lead=1)

Bases: object

Options for computing effects using Dynamic DoubleML

__init__(min_lead=1, max_lead=1)

Create a new DDMLOptions instance.

Parameters:
  • min_lead – Smallest lead to model
  • max_lead – Largest lead to model
leads

Return list where each element is a number of periods ahead to compute effects

class pricingengine.estimation.dynamic_dml.DynamicDML(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), feature_builders=None, treatment_builders=None, training_filter=None, options=DDMLOptions(1, 1), outcome_model_type='level', treatment_diff_models=[], sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True, cv_structure_fn=None, multi_task=False, no_constant=False)

Bases: pricingengine.estimation.double_ml.DoubleMLLikeModel

A series of DoubleML models, each lead contains a separate first stage model that corresponds to forecasting the outcome at a given lead. There is also a common causal_model that corresponds to causal impacts of treatments which are jointly learned from all models.

LEAD_LEVEL_NAME = 'lead'
__init__(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), feature_builders=None, treatment_builders=None, training_filter=None, options=DDMLOptions(1, 1), outcome_model_type='level', treatment_diff_models=[], sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True, cv_structure_fn=None, multi_task=False, no_constant=False)

Create a new instance of an effect model.

Parameters:
  • schema (Schema) – The schema for subsequent training and prediction data
  • baseline_model – Instance with subclass Model to be used for computing baseline treatment and outcome prediction models. This object may also be a dict which points from each column name to a corresponding Model.
  • causal_model (CausalModel) – Model to be used for computing treatment effects
  • error_model (Model) – Model to be used for estimating average (absolute) error size as a function of features (i.e. heteroskedasticity function)
  • feature_builders – List of VarBuilder objects used to create features for first stage regressions
  • treatment_builders – List of VarBuilder objects used to create treatments for second stage regressions
  • training_filter – function that takes the feature generator and estimation_dataset and returns a vector of bools which indicates which observations should be used for training. Default is all observations.
  • options (DDMLOptions) – Model options
  • outcome_model_type – FeatureGenerator.LEVEL_MODEL (default) trains first stage model in levels. FeatureGenerator.DIFF_MODEL trains first stage outcome model on first differences
  • treatment_diff_models – list of treatments that are estimated in first differences (Default is LEVEL)
  • sample_splitter – member of sklearn.model_selection used for sample splitting. Default is KFold.
  • cluster_date – Bool (default True) input for whether or not to cluster standard erros at the level of the date column
  • cv_structure_fn – function that takes the df multiindex and returns labelled structure. Used by either GroupKFold or StratifiedKFold. Default is to use the time variable.
  • multi_task – Bool (default False) to indicate whether one instance of the specified baseline model will be used to make predictions for multiple leads. (Only certain models have this capability, for instance CNTKCausalModel). If False, then N copies of the model will be used to model the outcome for each lead.
  • no_constant – Bool (default False) to force the construction of ConstVar treatments with all available interactions. If True, these constants are omitted.
static gen_prepredicted_baselines(df, base_error_class=None)

Converts a DataFrame of recorded predictions in dictionary of PrePredicted models

get_design_matrices(dataset)

Gets all the design (and related) matrices from all the stages :param dataset: Needs to have same schema as estimation dataset but can be much smaller :return: Tuple of baseline_variables, baseline_features, train_fold, causal_outcomes, causal_treatments.

The first two are nested dictionaries of lead->varname->data The inner datasets of the first two, and train_fold all have the same row index so can be concatted. The final two datasets are the causal regression. The causal variables will be the original values (possibly scaled) rather than residuals. For the error model, query the first two with Model.ERROR_VAR_NAME (Note: folds are meaningless here).
get_diffed_vars()
get_marginal_effects(treatment_name, competition_col, leads=None, filter_dic=None)
static get_rec_df_from_csv(fname, schema)

Reads a csv file with recordings from a DynamicDML prediction :param fname: filename of csv of recorded model predictions :param schema: Schema object :returns: DataFrame of prediction recordings

options

Return the options given during initialization

outcome_coefficients(lead)

Get first stage coefficients (averaged over splits) from outcome regression corresponding to the given lead. Will account for baseline feature scaling

Parameters:lead – integer corresponding to preferred lead
static translate_prediction_to_rec(pred_df, date_col, exp_ind=True)

Takes back the targets according to lead (because in fitting they are lagged to the information date)

static translate_rec_to_prediction(rec_df, leads, date_col)

Advances the targets according to lead (in the fitting they were lagged to the information date) and then averages across the folds of the model.

treatment_coefficients(lead, treatment_name)

Get first stage coefficients (averaged over splits) from treatment regression corresponding to the given lead and treatment_name. Will account for baseline feature scaling

Parameters:
  • treatment_name (str) – name of treatment variable
  • lead (int) – preferred lead
class pricingengine.estimation.dynamic_dml.MultiTaskBaseAndError(schema, baseline_model, causal_model, error_model, n_splits, leads)

Bases: pricingengine.estimation.dynamic_dml.BaseAndError

__init__(schema, baseline_model, causal_model, error_model, n_splits, leads)
baseline_fit_diagnostics()
baseline_models_feat_info(avg_splits=False, combine_vars=False)
error_model_predict(features_fit)
fit_baseline_models_featurized(common_features, lead_features, outcome_lead, treatments_lead, folds)
Parameters:
  • common_features – features common to all leads
  • lead_features – lead-specific features
  • outcome_lead – dict mapping lead to outcome variable values
  • treatments_lead – dict mapping lead to treatment variable values
  • folds – list of train test splits used for cross validation
fit_error_model(features_fit, err)
static gen_prepredicted(df)
outcome_coefficients(lead)
predict_baseline(common_features, lead_features, fold_fit_info)
Parameters:
  • common_features – features common to all leads
  • lead_features – lead-specific features
  • fold_fit_info – same data format as folds variable in fit_baseline_models_featurized
treatment_coefficients(lead, treatment_name)
class pricingengine.estimation.dynamic_dml.N_Split(n_splits)

Bases: object

__init__(n_splits)

pricingengine.estimation.estimation_dataset module

class pricingengine.estimation.estimation_dataset.EstimationDataSet(data, schema, validators=frozenset({<pricingengine.estimation.estimation_dataset.ValidPanels object>}), fold_fit_info=None)

Bases: pricingengine.estimation.typed_dataset.TypedDataSet

Dataset with known schema used for generating features

__init__(data, schema, validators=frozenset({<pricingengine.estimation.estimation_dataset.ValidPanels object>}), fold_fit_info=None)
Parameters:
  • data – pandas dataframe containing date, units, and price columns with a single column index
  • schema – Schema describing the data
  • validators – A list of validators to use for verifying data integrity
  • fold_fit_info – Series where each value is the index of the model that has this as the test portion or NaN if all folds can have this as test (when fit on subset of dataset). This is None if this dataset has been fit. Will be set after fit. We store this rather than folds since we can filter more easily.
append_data_one_instance(panel_dic, treatments_path, start_date)

Returns a new estimation_dataset object with additional rows corresponding to the product specified in panel_dic and the given price_path. The synthetic data will begin on the start_date and carry forward at the same intervals as the rest of the data. If necessary, it will overwrite pre-existing data.

Parameters:
  • panel_dic – dictionary of panel values that must specify a unique instance
  • treatments_path – dictionary (keyed by treatment_names) with values as iterables of numbers specifying the planned treatments of that instance week-by-week going forward. The 0th value of each iterable corresponds to the start_date
  • start_date – First week in which the price_path is applied. I.E. price_path[0] specifies the price on the start_date. This value must be in the estimation_dataset or immediately following an observation in the estimation_dataset.
static convert_folds_across_indexes(orig_folds, orig_idx, new_idx)

Converts fold info from one DataFrame index to another

data

Returns data

data_interval

Return the temporal spacing between consecutive data points

filter(filter_dic=None, first_date=None, last_date=None)

Returns a new estimation_dataset object which is filtered by the requirements in the filter_dic

Parameters:
  • filter_dic – dictionary mapping data columns to lists of allowed values
  • first_date – omit any data before this date
  • last_date – omit any data from after this date
fold_fit_info

Returns the folds (test part at least) for what was fit

static from_df(df, treatment_colname='treatment', outcome_colname='units', date_colname='date', is_panel_col=<function EstimationDataSet.<lambda>>, validators=frozenset({<pricingengine.estimation.estimation_dataset.ValidPanels object>}))

Create an EstimationDataSet from the given dataframe.

Parameters:
  • df

    A pandas dataframe containing price, units, and date columns.

    • String columns and panel columns will be interpreted as categorical columns
    • Float/int columns will be interpreted as numeric columns (convert numeric columns to string if
      the column is to be interpreted as categorical)
  • treatment_colname – The name of the numeric column containing treatment
  • outcome_colname – The name of the numeric column quantities
  • date_colname – The name of the datetime column containing dates
  • is_panel_col – A function that takes in a column names and returns a boolean indicating if the column is used to break the dataset into panels
  • validators – list of validators applied to the produced EstimationDataSet
gen_folds_for_new_index(new_idx)

Converts fold info from this object’s index to another

schema

Returns schema

set_folds_from_other_index(other_folds, other_idx)

Sets this object’s fold info to that from another context (fold_info and index)

pricingengine.estimation.regression module

class pricingengine.estimation.regression.Estimation(schema, cluster_date)

Bases: object

__init__(schema, cluster_date)
fit(estimation_dataset)

Fit baseline and causal models on the given dataset

Parameters:estimation_dataset (EstimationDataSet) – A dataset on which to train the model
get_coefficients(human_index=True)

Get coefficients from the causal model

Parameters:human_index – If True, then the interactions levels of the multiindex are squashed. Otherwise, they are are left separate (useful for automated post-processing).
get_standard_errors(human_index=True)

Get standard errors from the causal model

get_variance_matrix(human_index=True)

Get variance matrix from the causal model

predict(dataset, ret_pred=None)

Compute predictions for the given dataset using previously trained model

Parameters:
  • dataset (EstimationDataset) – A dataset containing features from which to generate predictions. The schema of the dataset must match the schema of the dataset used to fit the model.
  • ret_pred – Pass in an empty dataframe if you want that dataframe to be populated with predictions of the first stage models
Raises:
  • ValueError – If the schema of the given dataset does not match the schema given for initialization
  • RuntimeError – If the model has not yet been fit
class pricingengine.estimation.regression.Regression(schema, model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), regressor_builders=None, cluster_date=True)

Bases: pricingengine.estimation.regression.Estimation

Class for implement estimation with VarBuilders

__init__(schema, model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), regressor_builders=None, cluster_date=True)

Initialize a new Regression instance.

Parameters:
  • schema – The expected schema of datasets to be fit and transformed
  • model (Model) – used for estimation
  • error_model (Model) – model used to estiamte average abs error (i.e. heteroskedasticity function)
  • regressor_builders – List of VarBuilders used to create regressors
error_model

Return the error model

model

Return the causal model

pricingengine.estimation.typed_dataset module

class pricingengine.estimation.typed_dataset.ColType

Bases: enum.Enum

Input IDs for know pricing data column content

OUTCOME = 10
OUTCOME_RESIDUAL = 12
PREDETERMINED = 13

A description of the data contained in a single column

  • The column tagged as ColType.ITEM must have DataType.CATEGORICAL
  • The column tagged as ColType.OUTCOME must be have DataType.NUMERIC
  • The column tagged as ColType.TREATMENT must be have DataType.NUMERIC
TREATMENT = 9
TREATMENT_RESIDUAL = 11
class pricingengine.estimation.typed_dataset.TypedDataSet(data, schema, required_types)

Bases: pricingengine.dataset.DataSet

Dataset class

__init__(data, schema, required_types)

Initializes a new instance of the DataSet class. The DataSet class combines time series data, a schema that specifies the column meta-data for the the given time series data.

The given data-schema pair needs to adhere to the following expectations:

  • Each column defined in the given schema must be contained in the corresponding given time series data

  • Each column must have a data type corresponding to its schema DataType as follows:

    • DataType.NUMERIC: integer or floating-point
    • DataType.DATE_TIME: datetime
    • DataType.CATEGORICAL: string or integer
  • In the specified schema, the name of the column with id ITEM must also be included in the list of panel column names

Parameters:
  • data – The time series data to be used for computing effects
  • schema – The schema specifying the meta-data for the time series
group_labels

A list parallel to the rows of the dataset with a label for each row. The labels can be passed to a Pandas groupby call to group data using known groups.

Module contents